Author: Charles Tapley Hoyt
Estimated Run Time: 5 minutes
This notebooks outlines the process of generating unbiased candidate mechanisms and comparing them to the dogmatic mechanisms from the NeuroMMSig Knowledge Base.
In [1]:
import logging
import itertools as itt
import os
import time
from collections import defaultdict
from operator import itemgetter
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib_venn import venn2
import pybel
import pybel_tools as pbt
from pybel.canonicalize import calculate_canonical_name
from pybel.constants import *
from pybel_tools.visualization import to_jupyter
In [2]:
#%config InlineBackend.figure_format = 'svg'
%matplotlib inline
In [3]:
time.asctime()
Out[3]:
In [4]:
# seed the random number generator
import random
random.seed(127)
In [5]:
pybel.__version__
Out[5]:
In [6]:
pbt.__version__
Out[6]:
To make this notebook interoperable across many machines, locations to the repositories that contain the data used in this notebook are referenced from the environment, set in ~/.bashrc
to point to the place where the repositories have been cloned. Assuming the repositories have been git clone
'd into the ~/dev
folder, the entries in ~/.bashrc
should look like:
...
export BMS_BASE=~/dev/bms
...
The biological model store (BMS) is the internal Fraunhofer SCAI repository for keeping BEL models under version control. It can be downloaded from https://tor-2.scai.fraunhofer.de/gf/project/bms/
In [7]:
bms_base = os.environ['BMS_BASE']
The Alzheimer's Disease knowledge assembly has been precompiled with the following command line script, and will be loaded from this format for improved performance. In general, derived data, such as the gpickle representation of a BEL script, are not saved under version control to ensure that the most up-to-date data is always used.
pybel convert --path "$BMS_BASE/aetionomy/alzheimers.bel" --pickle "$BMS_BASE/aetionomy/alzheimers.gpickle"
The BEL script can also be compiled from inside this notebook with the following python code:
>>> import os
>>> import pybel
>>> # Input from BEL script
>>> bel_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.bel')
>>> graph = pybel.from_path(bel_path)
>>> # Output to gpickle for fast loading later
>>> pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.gpickle')
>>> pybel.to_pickle(graph, pickle_path)
In [8]:
pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers', 'alzheimers.gpickle')
In [9]:
graph = pybel.from_pickle(pickle_path)
In [10]:
graph.version
Out[10]:
The unique values for the subgraph annotation are extracted and counted.
In [11]:
subgraph_names = sorted(pbt.summary.get_annotation_values(graph, 'Subgraph'))
In [12]:
len(subgraph_names)
Out[12]:
The biological process nodes are retrieved with pbt.filters.get_nodes_by_function and counted.
In [13]:
bioprocess_nodes = sorted(pbt.filters.get_nodes_by_function(graph, BIOPROCESS))
In [14]:
len(bioprocess_nodes)
Out[14]:
The BEL graph object is split to multiple BEL graph objects representing each of the subgraphs with pbt.selection.get_subgraphs_by_annotation.
In [15]:
subgraphs = pbt.selection.get_subgraphs_by_annotation(graph, 'Subgraph')
Asking which bioprocesses appear in multiple subgraphs allow us to investigate their upstream controllers. In a knowledge discovery scenario, this could provide evidence of cross-talk.
In [16]:
bp2sg = defaultdict(set)
for name, subgraph in subgraphs.items():
for bp in pbt.filters.get_nodes_by_function(subgraph, BIOPROCESS):
bp2sg[bp].add(name)
bp2sg = dict(bp2sg)
bp2sg_counts = pbt.utils.count_dict_values(bp2sg)
In [17]:
sg2bp = {
name: set(pbt.filters.get_nodes_by_function(subgraph, BIOPROCESS))
for name, subgraph in subgraphs.items()
}
sg2bp_counts = pbt.utils.count_dict_values(sg2bp)
In [18]:
fig, (lax, rax), = plt.subplots(1, 2, figsize=(10, 3))
lax.set_title('Promiscuity of Biological Processes')
lax.set_ylabel('Frequency')
lax.set_xlabel('Subgraphs in which a biological process appears')
lax.hist(list(bp2sg_counts.values()), log=True)
rax.set_title('Subgraph Specificity')
rax.set_ylabel('Frequency')
rax.set_xlabel('Member biological processes of a subgraph')
rax.hist(list(sg2bp_counts.values()), log=True)
plt.tight_layout()
plt.savefig(os.path.join(os.path.expanduser('~'), 'Desktop', 'sg_comparison.pdf'))
plt.show()
The top 25 most frequently biological processes are shown below.
In [19]:
for bp, count in bp2sg_counts.most_common(5):
print('{:2} {:6} {}'.format(count, graph.node[bp][NAMESPACE], graph.node[bp][NAME]))
The top 25 highest biological-process dense subgraphs are shown below.
In [20]:
for name, count in sg2bp_counts.most_common(5):
print('{:2} {}'.format(count, name))
In [21]:
sg_bp_membership = defaultdict(dict)
sg_bp_membership_bool = defaultdict(dict)
for name, bp in itt.product(subgraph_names, bioprocess_nodes):
sg_bp_membership_bool[name][bp] = bp in sg2bp[name]
sg_bp_membership[name][bp] = 1 if bp in sg2bp[name] else 0
sg_bp_membership = dict(sg_bp_membership)
sg_bp_membership_bool = dict(sg_bp_membership_bool)
Each subgraph annotation in the NeuroMMSig Database comes from a specific domain of study. While these subgraphs attempt to describe groups of related, dogmatic pathways and are helpful for communicating ideas, they are a discretization of the continuous and inseperable biological system. In this section, we will generate more, smaller candidate subgraphs with an unbiased approach in hopes of providing a more thorough overview of the individual mechanisms in the hollistic system and their interplay or "cross-talk". Later, we'll use that information to assess the overlap between dogmatic mechanisms and bridge the gaps between them.
The candidate mechanisms themselves are generated with pbt.generation.generate_bioprocess_mechanisms by expanding to the upstream controllers around each biological process.
In [22]:
%%time
candidate_mechanisms = pbt.generation.generate_bioprocess_mechanisms(graph)
In [23]:
plt.title('Histogram of Candidate Mechanism Sizes')
plt.xlabel('Number Nodes')
plt.ylabel('Frequency')
plt.hist([cm.number_of_nodes() for cm in candidate_mechanisms.values()], log=True)
plt.show()
In [24]:
%%time
sg_bp_overlap = defaultdict(dict)
for name, bp in itt.product(subgraphs, bioprocess_nodes):
x = set(pbt.filters.get_nodes_by_function(candidate_mechanisms[bp], {PROTEIN, BIOPROCESS}))
y = set(subgraphs[name].nodes_iter())
sg_bp_overlap[name][bp] = pbt.utils.set_percentage(x, y)
sg_bp_overlap = dict(sg_bp_overlap)
In [25]:
overlap_df = pd.DataFrame(sg_bp_overlap)
overlap_df.to_csv(os.path.expanduser('~/Desktop/subgraph_comparison.csv'))
In [26]:
plt.title('A Histogram of Pairwise Overlap')
plt.xlabel('Tversky Similarity ($\alpha = 1$ and $\beta = 0$)')
plt.ylabel('Frequency')
plt.hist(overlap_df.as_matrix().ravel(), log=True)
plt.show()
The landscape of overlaps between unbiased candidate mechanisms and dogmatic subgraphs is shown below.
In [27]:
overlap_matrix = overlap_df.as_matrix()
cg = sns.clustermap(overlap_matrix, figsize=(10, 7))
In [28]:
plt.savefig(os.path.join(os.path.expanduser('~'), 'Desktop', 'sg_overlaps.pdf'))
plt.show()
The overlap similarity matrix is discretized with a cutoff and displayed below.
In [29]:
overlap_cutoff = 0.3
In [30]:
overlap_matrix_discrete = overlap_cutoff < overlap_matrix
In [31]:
sns.clustermap(overlap_matrix_discrete , figsize=(10, 7))
plt.show()
In [32]:
#: Keeps track of which biological processes have large overlap with a dogmatic mechanism, but wasn't in it
sg_added = {}
sg_bp_cutoff_member = defaultdict(dict)
for name in subgraph_names:
sg_added[name] = set()
for bp in bioprocess_nodes:
concordance = overlap_cutoff < sg_bp_overlap[name][bp]
sg_bp_cutoff_member[name][bp] = concordance
if bp not in sg2bp[name] and concordance:
sg_added[name].add(bp)
In [33]:
plt.xlabel('Number of biological processes added to dogmatic subgraph')
plt.ylabel('Frequency')
plt.hist([len(v) for v in sg_added.values()], log=True)
plt.show()
Similarity values between the set of added biological processes and the set of contained biological processes for each dogmatic mechanism are calculated.
Higher similarity means that this approach has lower potential for asserting subgraph expansions, so the values are subtracted from one to preserve monotonicity and improve interpretability.
In [34]:
sg_concordance = {}
for name in subgraphs:
x = sg_added[name]
y = {bp for bp in bioprocess_nodes if sg_bp_cutoff_member[name][bp]}
sg_concordance[name] = 1 - pbt.utils.tanimoto_set_similarity(x, y)
In [35]:
plt.ylabel('Frequency')
plt.xlabel('Potential for adding biological processes not in the dogmatic subgraph')
plt.hist(list(sg_concordance.values()))
plt.show()
The similarity values correspond to which dogmatic subgraphs have the highest potential for expansion, with higher values corresponding to higher potential.
In [36]:
for dm, concordance in sorted(sg_concordance.items(), key=itemgetter(1), reverse=True):
print('{:.2f} {}'.format(concordance, dm))